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Abstract 

Variable selection for recovering sparsity in nonadditive nonparametric models has 
been challenging. This problem becomes even more difficult due to complications in 
modeling unknown interaction terms among high dimensional variables. There is cur- 
rently no variable selection method to overcome these limitations. Hence, in this paper 
we propose a variable selection approach that is developed by connecting a kernel 
machine with the nonparametric multiple regression model. The advantages of our ap- 
proach are that it can: (1) recover the sparsity, (2) automatically model unknown and 
complicated interactions, (3) connect with several existing approaches including linear 
nonnegative garrote, kernel learning and automatic relevant determinants (ARD), and 
(4) provide flexibility for both additive and nonadditive nonparametric models. Our 
approach may be viewed as a nonlinear version of a nonnegative garrote method. We 
model the smoothing function by a least squares kernel machine and construct the non- 
negative garrote objective function as the function of the similarity matrix. Since the 
multiple regression similarity matrix can be written as an additive form of univariate 
similarity matrices corresponding to input variables, applying a sparse scale parameter 
on each univariate similarity matrix can reveal its relevance to the response variable. 
We also derive the asymptotic properties of our approach, and show that it provides a 
square root consistent estimator of the scale parameters. Furthermore, we prove that 
sparsistency is satisfied with consistent initial kernel function coefficients under certain 
conditions and give the necessary and sufficient conditions for sparsistency. An efficient 
coordinate descent/backfitting algorithm is developed. A resampling procedure for our 
variable selection methodology is also proposed to improve power. 

Keywords: Automatic Relevant Determinant; Kernel machine; LASSO; Multivariate 
smoothing function; Nonnegative garrote; Sparsistency; Variable selection. 
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1 Introduction 



The variable selection problem is important in many research areas such as genomics, data 
mining, image analysis, text and speech analysis, and other areas with high dimensional 
data. In general, the input variables form an interacting network with one another and 
modeling these interactions is complicated due to high order interaction terms. 

There are numerous approaches to modeling high dimensional data using mult i- dimensional 
nonparametric models (Wahba, 1990; Green and Silverman, 1994; Hastie and Tibshirani, 
1990). Most variable selection approaches in multi-dimensional nonparametric models are 
performed in terms of function components selection, that is, modeling the function compo- 
nents (including nonlinear interactions) additively and then selecting significant components. 
Examples of these variable selection approaches are Component Selection and Smoothing Op- 
erator (COSSO) (Lin and Zhang, 2006), Sparse Additive Models (SAMs) (Ravikumar et al, 
2009), and the extension of SAMs, Variable Selection using Adaptive Nonlinear Interaction 
Structure in High dimensions (VANISH) (Radchenko and James, 2010). However, when 
the number of input variables is large and their interactions are complicated, modeling each 
interaction term is extremely expensive and these function components approaches may not 
be efficient. 

Other variable selection approaches based on the kernel machine method have achieved 
great success. Liu et al. (2007) established the connection between the least squares kernel 
machine (LSKM) and linear mixed models. Zou et al. (2010) employed a nonparametric 
regression model with a Gaussian process which simultaneously considers all possible inter- 
actions. In these works, the interactions among the multi-dimensional variables are modeled 
automatically by the kernel. Because of their simplicity and generality, function kernels and 
associated function spaces are a powerful technique to analyze multi-dimensional data. 

In this paper we will also focus on variable selection approaches based on the kernel ma- 
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chine method because the family of kernel functions is extremely rich for multiple regression 
smoothing. They have the flexibility for various models including additive functional ANOVA 
and nonadditive smoothing functions. For example, since any symmetric positive definite 
matrix is a valid Gram matrix, an additive Gram matrix K = X^O-^-i (0 s are nonnegative 
hyperparameters) can be used for the functional ANOVA / (x T ) = Y7j=i fj( x j)i where Kj 
is the Gram matrix for the jth function space fj and x T = (xi, ...,x p ). According to the the 
Representer Theorem (Kimeldorf and Wahba, 1971), a nonparametric function can be repre- 
sented using a kernel function, fj(x) = Y^n=i a ikj(xi,x) (the dual representation), where the 
c^'s are the kernel function coefficients. With penalty on the norm (or pseudonorm) of the 
jth function component fj, \\fj\\u K -i sparsity of the function components can be recovered 
(Lin and Zhang, 2006; Bach, 2008). 

However, with nonadditive smoothing functions, the kernel function k(x, x') is usually 
a nonlinear function of multivariate x, such as the Gaussian kernel function. In a model 
with such a kernel function, the response can no longer be expressed in terms of additive 
function components and no sparse function components are available. Therefore variable 
selection for recovering the sparsity of x within the nonadditive function becomes challenging. 
To address this issue, some Bayesian approaches of variable selection for Gaussian process 
models have been developed (Linkletter et al. 2006; Savitsky et al. 2011). To the best 
of our knowledge, no variable selection method based on the kernel machine have been 
established for nonadditive smoothing function models simultaneously recovering sparsity of 
input variables in a nonadditive smoothing function. 

Thus the goal of this paper is to study this model from the different view (Nonnegative 
Garrotte on Kernel) and to generalize it to include different Gaussian process kernels as a 
new variable selection approach on kernel machine, which is able to recover sparsity of input 
variables in a nonadditive smoothing function. 
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Our method is motivated by automatic relevance determination (ARD), which was 
originally formulated in the framework of neural networks in the context of Gaussian 
process (Neal, 1996; MacKay, 1994) considering kernel functions of the form k(x, x') = 
exp j— Y^j=i iji x j ~ x i) 2 }- We model the smoothing function with a general kernel func- 
tion with hyperparameters fy's. By shrinking these scale parameter £j's, we can select the 
variables. In this way, our approach can be applied to either additive or nonadditive models 
by choosing different K structure. 

For theoretical understanding of our approach, we develop the incoherence conditions. 
We show that under certain conditions sparsistency can be established. To recover sparsity 
of £j's, an efficient coordinate descent/backfitting algorithm has been developed to achieve 
the regularization path for £j's. 

The remainder of this paper is organized as follows. In Section [2j we first define the 
optimization function of our approach on the kernel model and discuss the connection of 
our approach with the linear nonnegative garrote model and also with the kernel machine 
learning problem. In Section [3] we propose our coordinate descent updating algorithm for the 
solution path of the scaling parameters. In Section [4] we discuss the necessary and sufficient 
conditions for consistency and sparsistency. We also show the asymptotic properties of our 
method with the consistent initial kernel function coefficients. In Section |5j we present several 
simulation examples. In Section[6j we apply our method to two real datasets: a cryptography 
dataset and a genetic pathway dataset. Section [7] contains concluding remarks. 
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2 Flexible Multivariate Nonparametric Modeling 

2.1 Multivariate Nonparametric Model Using Kernel Machine 

Consider an n-observation and p-predictor dataset (y, X), where X = [x 1; x 2 , x p ] and 
Xj = (xji, Xj n ) T is an n x 1 vector for the jth predictor, j = l,...p. In other words, 
X = [xi,x 2 , x n ] T where xj is a 1 x p vector of predictors of ith observation, i — 1, ...n. 

According to the Representer Theorem, the nonparametric multiple regression model can 
be expressed as (Kimeldorf and Wahba, 1971) 

y = f(X) + e = Ka + e, (1) 

where e ~ N(0, a 2 I) and K is the kernel matrix corresponding to the function space Hk, 
f G H-Ki also known as a "Gram matrix" of the kernel function k(x, x'). Thus the nonlinear 
function f(x) can be expressed as YH=i a i^{ x ii x )i where a = (ot\, ...,a n ) T is an n x 1 vector 
of the coefficients of the kernel function. Note that in model ([T]), y is centered, i.e. Yl Vi = 0- 
We also standardize X such that Y^i=i x ji = an( i S™=i x % = ^->3 = ■••■>$• 

To estimate ct in ([T]) , least squares kernel machine estimation minimizes the least squares 
error with penalized norm ||f \\^_ = ol t Kcx induced by the kernel of the function space Hk, 

±\\ y -Ka\\ 2 + ^Xoa T Ka, (2) 

and the solution is 

a = (X I + K)- 1 y, (3) 

where Ao > is a smoothing parameter which balances the tradeoff between goodness of fit 
and smoothing the curve or high dimensional surface. 
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2.2 Nonnegative Garrotte on Kernel (NGK) 

The Gram matrix can be viewed as applying a componentwise function on the similarity 
matrix among observations. The similarity metric between x and x' can be either the 
negative squared Euclidean distance, —\\x — x'\\ 2 , or the angle (dot product), x T x' . Both 
of the similarity metrics can be written in an additive form in terms of p predictors, i.e. 
— \\x — x'\\ 2 = — {(xi — x[) 2 + ... + (xp — x' p ) 2 } and x T x' = X\x' x + ... + x p x' p . By this 
additivity, the kernel matrix can be expressed as either a linear or nonlinear function of the 
additive form. For example, the Gram matrix by the dot product x T x' among observations 
is the linear polynomial kernel 

v v 
K(X) = pXX T = J>x,xJ = J2pD j , 

3=1 3=1 

where = Xj-xJ, with (k, /)th entry d J kl = XjkXji, 1 < k,l < n, and p is a scale parameter. 
Unlike a linear kernel, the Gaussian kernel can be expressed in a nonlinear function form 
because the Gram matrix with entries produced by the exponential function of — \\x — x'\\ 2 
is 

K(X)=ex V (^J2 D ^> 
where D J is the matrix with (/c, /)th entry d J kl = —{x^ — Xji) 2 and the (k, Z)th entry of matrix 

Y7 j= i DJ is -ll^fe - x i\\ 2 = - J2 p j= i( x 3k - Xji) 2 - 

More generally, let us consider a nonnegative scale parameter £ = (£i> with £j 

corresponding to each predictor Xj. Then both kernels can be expressed as 

m,x)= 9(^2^, (4) 

where ^ > 0, j = 1, ...,p, and function g(-) is a componentwise function of matrix entries. 
That is, for a linear polynomial kernel, all = p and g(-) is the identity function, and for 
a Gaussian kernel, all £j = p and g(-) = exp(-). Note that we do not need extra constraints 
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on the £j's such as = 1- This is because for a Gaussian kernel, exp(-) already places 
constraints on the £j's and, for a linear polynomial, the solution paths of the £/s with and 
without constraints differ only by a scalar factor and their sparsity properties remain the 
same. Thus for computational convenience we do not apply constraints on £ for either 
kernels. 

By introducing such nonnegative parameters to the kernel matrix, we can develop a 
variable selection approach for the nonparametric regression model ([!]) similar to the linear 
nonnegative garrotte method (Breiman, 1995). That is, we apply an extra penalty on £ such 
that optimization problem ^ is subject to £j > and — c ( c * s a positive real number), 
which results in the optimal problem 



where A > is a tuning parameter. We can refer to this method as "nonnegative garrote on 
kernel machine". 

2.3 Connection with Linear Nonnegative Garotte Estimator 

Introduced by Breiman (1995) for variable selection with the linear model y = f +e = Xf3+e, 
the linear nonnegative garotte estimator for the shrinking factor £ = (£i,-..,£ P ) T is the 
solution that minimizes 



where [3° is the initial estimate of /3j using ordinary least squares. For an orthonormal 
design X T X = I (i.e. x?Xj = 5^ such that = OWi ^ j and 8y = lVz = j), the 
nonnegative garrote solution for /3j is 






2 



subject to & > 0, Vj, 



(6) 



/ 



\ 



k = tjP? LS and ij 



1 



2 



V 
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where subscript "+" indicates the positive part of the expression. We can see that (|6j) is a 
special case of (J5]) with a linear polynomial kernel. To see this, consider ^ without penalty 
on ot. Thus Ao = and kernel matrix K = ^2, O x i x J- Then the least squares kernel machine 
solution is related to the OLS solution by choosing the initial a = y and we obtain the initial 
estimate for the response 

where fj represents the initial marginal response and the OLS estimate of the linear model 
is (3° LS = eJ(X T X)~ 1 X T y = xjy since X T X = I. ej here is the selection vector with 1 in 
the jth position. 

Yuan and Lin (2007) proposed a general linear nonnegative garrote approach 

2 lly - z £\\ + raA z^^'' sub J ect to ^ °' V ^'' ( 7 ) 

where Z is the matrix of columns of initial marginal response estimates. They proved that 
if the initial estimate is consistent, then the nonnegative garrote estimate is also consistent. 
Based on this idea, Yuan (2007) applied the nonnegative garrote component selection method 
to functional ANOVA models using the initial estimates of the function components fj as 
columns of Z (fj can be an interaction component). 

In Section |4| we will further derive similar approximated linear form of objective function 
(5) and show that Z is a matrix with columns (jjjfj = D^cx,j = 1, ...,p. In this case, it 
requires a local linear approximation of the kernel, say K(£) ~ K(g*) +X}j=i(£j — 
For a general kernel function, ck can be understood as the slope of the change of initial 

/ along £j direction given initial a. However, for nonlinear kernel functions such as the 
Gaussian kernel, we can not derive the algorithm with the linear form of ([7]). 
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2.4 Connection with Kernel Machine Learning 

First define function Qo(-) as 

Q (K(t X), a) = i ||y - JT(£, X)a|| 2 + ^a T X(£, A>. (8) 

The following lemma shows that Qo can be expressed in a form of function of K. 
Lemma 1: If K is in the set of all kernels on input set X, and if for a set of distinct points 
X G X K(X) is positive definite, then 

Q (K) = ^y T (\ I + Ky 1 y, (9) 

and Qq(K) is a non-increasing convex function of K. 

The form of Q can be easily derived since the solution for the least squares kernel 
machine problem is a. = (X I + K)~ 1 y. Plug this solution back into ^ and by simple 
algebra we obtain ([9]). A formal proof of Lemma 1 and convexity of Qq(K) can be found 
in Micchelli and Pontil (2005) and Lanckriet et al. (2004). Then the solution £ of the 
optimization problem ^ is considered as 



mm Q Q (K) = ) fy T (X I + K)- 1 y, 
subject to K e K* = jif^X) : £ £ IR+ and < c,j = l,...,p 

where 1R+ is set of p dimensional nonnegative real numbers. 



(10) 



The objective function (10) implies that we have a kernel based learning problem on 
K*, a subset of all kernels on input set X. More generally, the problem associated with the 
function Q (K) and the kernel K is the variation problem (Micchelli and Pontil, 2005) 

Qo(K) = inf{Q {K) : K e K}, (11) 

where IK is a convex set of all positive semidefmite kernel functions. Thus our problem 



can be viewed as a special case of (11), as learning the kernel function via regularization 
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^||f||^ = ?fa T Ka, subject to K £ K*, where K* C K. If optimizing Q on K, this is the 
problem of learning the kernel discussed by Micchelli and Pontil (2005). 

Although our problem is learning the kernel K e K*, it is different from the problem of 
learning the kernel function discussed by Micchelli and Pontil (2005), Lanckriet et al. (2004), 
and Rakotomamonjy et al. (2008). This is because the set IK* C K is usually not convex 
and the optimization problem is learning the kernel through a nonlinear function g(-) on 

We can express ^ as a function of 

Q(Z) = Qo(t) + A J2 & = Y yT (A ° J + Km ~' y + A E & (12) 

3=1 3=1 

The convexity of Qo(£) is interesting since it determines the convexity of Q(£) and convex 
objective functions have many convenient properties in the optimization problem. Unfortu- 
nately, however, it is not straightforward to determine the convexity of Qq(£) as its convexity 
completely depends on the kernel function K(£) and X. The following Lemma shows a suf- 
ficient condition for Q(£) to be a convex function of 

Lemma 2: If matrix set K(£) = giY^^iCj^) ^ s concave on i.e. K(9£ + (1 — b 



0K{^) + (1 — 9)K{$ s ') where < 9 < 1, then the regularization problem (12) is convex on 

This can be easily shown by the composition theorem (Boyd and Vandenberghe, 2004). 
That is, for a function f(x) = h{g(x)}, f is convex if h is convex and non-increasing and 



g is concave (see Appendix A.l). An obvious example of a concave kernel K is the linear 
polynomial kernel (which is convex too). So we conclude that the objective function Qo(£) is 
a convex function of £ for a linear polynomial kernel and = Qo(£) + A X] Cj is a strictly 
convex function of Thus the regularization problem p2[ ) has many of the nice properties 
of convex optimization. In particular Q(£) is strictly convex, the solution £ is unique. 

However, in many cases, it is not straightforward to derive the concavity or convexity 
of K(£). For instance, with the Gaussian kernel, since K(£) is neither concave nor convex 
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on £, it is difficult to determine the convexity of Q($). The most ideal scenario is, Q(£) = 
Qo(£) + is quasicovex (unimodal) so that Q($,) has a unique minimum. In this case, 

the Hessian matrix of Q{$,) may not be positive (semi-) definite everywhere. However, one 
can expect a positive (semi-)definite Hessian matrix in the neighborhood of a minimum, that 

is, to. 

In practice, we start with an initial £ close to zero when solving the optimization problem. 
Hence we can assume the initial values and the solution path are always in the neighborhood 
of a minimum where H >z 0. This is a reasonable assumption with Gaussian kernels since in 
the sparsity problem most £ 3 -'s are zero and, in general, non-zero £j's are all small positive 
numbers. When the £j's are small numbers or zeros, Gaussian kernel can be well approxi- 
mated to the linear order of the Taylor expansion as a concave matrix of In the following 
section we provide the regularity conditions which are usually satisfied in least squares error 
optimization. 

2.5 Some Notation and Regularity Conditions 

We first define some notation. Let and £ represnt the true £ and minimum solution of 
d5J), respectively. Suppose vector is sparse, i.e. some £* = 0. Without loss of generality, 
denote = (£*, £*) T = ^£* T , £o T j j where £1 is the vector of the first a nonzero £*'s and 
£q is the zero vector. Define the nonzero index set of £* as A := {j G {1 , . . . , p}\^* > 0}, and 
denote A := {j G {1, > 0} as the nonzero index set of Note that A has relatively 

small cardinality a = \A\, the number of true nonzero £j's. 

Let the least squares error estimate of a be a = A _1 (^)y, where A(£) = A / + K(£), 
and denote the true ex. vector as ex* . For a given a* and estimate ex we define the following 
matrices and their respective partitions: 

Z=[z 1 ,...,z p ] = [ Zl ,Z ] = [{tyna^^WieW}^^] (13) 
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Z = [z 1; ...,z p ] = [Z^Zo] = UjCffia} ,{!%{?)&} 



a+l<j<p 



(14) 



where Kj(£*^ 



dK 



obtained by taking the partial derivative of the componentwise 



entries of K. Note that K'-ot is an n x 1 vector and the Z's and Z's are n x p matrices. Some 



covariance matrices are also defined as 



En = (n 1 Zf Zi) , Soi 
En = (n^ZfZ^, E 



01 



(15) 



where E n and E n are assumed to be invertible. 

We then further define the p x 1 vector v n (£) and p x p matrix M n (£) as follows: 

T 



1 



-i 



y^A" 1 ( ^ ) A-V 



i<i<p 



M n {£) = (A n) 



.i9 2 Qo(^) 



1 

2n 



Ta-1 



y^A 



OK ^ ,dK dK x ,dK d 2 K 
— A" 1 A -1 



A-V 



l<i,i<P 



These are analogous to the negative score and Hessian matrix of a log likelihood function 
— Qo(0, respectively. To see this and the regularity conditions, we first consider the log 
likelihood function of our model. Then, up to an additive constant, the negative log likelihood 
function of £ is 

2^ ||y - K(& X) a \\ 2 + ^<* T K(t X)a - \ log \K(t X) | + n\ £ 
Letting Ao = ^ and a 2 \ = A, the above expression is equivalent to 

i ||y - K(£, X)af + ^a T K(£, X)oc - £ log \K{t X) | + nX £ 



(16) 



Expression (16) only differs from ^ with respect to a log \K\ term. Thus, strictly speaking, 
estimating £ by ^ no longer provides the MLE estimate. We omit the log \ K\ term since in 
the NGK model, values of the £j's are usually sparse and small. Hence, in the region where 
we estimate and the determinant of K is almost constant of £ for both the Gaussian 
kernel and linear polynomial kernel. In this sense, ^ and (16) are equivalent. However, 
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our algorithm benefits greatly from omitting log \K\ since derivatives of this term result in 



complicated expressions. Therefore we assume the minimum of (16) is not greatly affected by 
log | if | and we can still consider our objective functions Q(£) and Qo(£) as the log likelihood 
function of with and without a prior for £, respectively. 

Based on these arguments, considering the convexity and differentiability of Qo(£), we 
can assume that the regularity conditions of the log likelihood also apply to Qq(£) such that 

limv n (£*) -)> v*, and ||v* < oo, 

(17) 

lim M n {C) -> M*, and HM*^ < oo. 



Particularly 



1 



lim v n (£*) -)• - lim <j ^a* T i^.(£*)a* } = v* 



2x/n 



i<j<P 

These regularity conditions indicate that v n (£) = O p (l) at $,* and M n (£) is finite and positive 
(semi-) definite at £*. These conditions are consistent with the convexity assumption of Qo(£) 
discussed in the previous section. 

3 Methodology 

In this section, we provide an efficient algorithm to solve the objective function ^ with a 
given initial ck. 

3.1 Backfitting Algorithm to Update £ 

An efficient algorithm to achieve the regularization path of (JsJ) for (a, £) is still an open 
problem. One possible approach is to iteratively update between a. and £ until convergence 
from an initial (6t-°\$^ ^). This updating approach, however, could be very expensive and 
may not be able to converge. Another possible approach is the one-step update algorithm 
proposed by Lin and Zhang (2006) for COSSO. That is, at each fixed A, solve £ with given 
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ex, then update ex = A _1 y with the new £ and continue to the next step. However, the 
one-step update algorithm may not be necessary to solve the solution path of £ as long as 
we have some consistent initial estimate of ex and keep it fixed through the entire solution 
path. In Section [4] we will show theoretically that, as long as ex is consistent, the sparsity of 
£ can be recovered as n increases. Although the initial en-fixed algorithm may not results in 
a consistent estimation of we will also show under certain conditions that the estimation 
consistency of £ can be achieved. 

With fixed initial consistent ex, the algorithm to update £ becomes efficient. The algo- 
rithm we propose in the following can be viewed as the non-linear version of the coordinate 
descent algorithm for nonnegative garrotes. In special cases where linear polynomial kernels 
or other additive multiple kernels are considered, our algorithm is equivalent to the least 
angle regression selection (LARS) algorithm. 

The steps of our algorithm are summarized as follows: 

• Step 1 Initialize ex = (X I + K(p)) _1 y and A = cr 2 ja 2 a by setting all ^ = p and fitting the 

least squares kernel machine by MLE/REML methods. 

• Step 2 Determine the initial A for which all £j = 

A (0) = max jn" 1 (y - K(0)af (i^(O)a) 
where y = y — 4f ex. 

• Step 3 Update £ coordinate wise at A^" 1 " 1 ^ with given ex by the following equation until 

converge: 



(y - Kcx) T K)cx - n\( k+ V 



(19) 



(K' 3 exf (Kfi) 

where £j denotes the previously updated and K and K', are calculated from 
previously updated £/s. 
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Step 4 Decrease A and repeat step [3j 

Step 5 Stop when model selection criterion reaches minimum or last A = 0. 



In Step 3, we derive the updating equation (19) for £ via an approximation of K(£). 
At given A, assuming the current iteration £ is close to the minimum solution £, the kernel 
matrix can be extended in one coordinate direction around 



where (—j) denotes exclusion of £j. In simple notation 

&) + A}, (20) 



where = K{£) and Aj = j As an example: for a Gaussian kernel K[- = Ko and 
for a linear polynomial kernel K 1 - = D\ where "o" denotes the Schur product or entrywise 
product of two matrices. 

The updating solution of given \ is achieved by plugging (20) into (Jsj) and solving 



£j = argmin (j5l) given ck and Ao- We notice that expression (19) is similar to the backfitting 
algorithm (Ravikumar et al., 2009; Hastie and Tibshirani, 1990) in nonparametric additive 
models, except our algorithm is a version of backfitting on nonadditive models by considering 
the £j updating step as 

• Step a Initialize £j = £!j k \j = 1, ■■■■,p with a given. 

• Step b (1) Compute the residual, r j = y — Ka + ^K'-ol. 

(2) Project the residual onto Zj = K'^dt and Pj = zjr^. 

Pi-nX 



(3) Update — i M - lfJ 



Kir 

Step c Repeat [b] until the individual £ 3 -'s do not change. 
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3.2 Advantages of our Algorithm 

When we use a linear polynomial kernel, K — ^ £jXj-xJ\ we revert back to the additive kernel 
case, K = J2£jKj, which has been thoroughly discussed by Bach (2008) and Rakotoma- 
monjy et al. (2008) for the multiple kernel learning (MKL) with LASSO. For the additive 
kernel case, this is closely related to functional ANOVA models. Yuan and Lin (2007) pro- 
posed nonnegative garrote component selection in these models, and introduced the LARS 
algorithm for solving the linear nonnegative garrote problems when p < n. When p > n, 
the LARS algorithm may not work well due to singularity of the active variable correlation 
matrix and reliance on generalized inverse matrices. On the other hand, our algorithm has 
two main advantages: first it works for p > n, second it works with nonadditive kernels. In 
addition, our algorithm is related to ARD, which is typically discussed in a Bayesian context 
(Neal, 1996; Krishnapuram et al., 2004; Zou et al., 2010). To the best of our knowledge, our 
algorithm is the only non-Bayesian approach for determining the hyper parameters in ARD 
with penalty on and has the added bonus of being more efficient. 

3.3 Model Selection 

Variable selection depends on how to select the penalty parameter. After determining the 
penalty parameter by some criteria, one can decide the final model which contains the most 
relevant variables to the response. However, as we discussed before, NGK variable selection 
is rather new topic within the kernel machine framework. There is no similar work available 
to provide a perfect criterion for selecting penalty parameter A. One can choose a optimal 
A at the minimum of the criterion and then obtain variables in a model at this optimal A. 
However in practice most of the popular criteria such as BIC, Cp and GCV may become 
flat. It becomes hard to determine an appropriate minimum. Another issue is that there is 
no perfect criterion in existence. The performance of any criterion not only depends on the 
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model, but also depends on the data structure. Hence in our study, we propose to select 
variables according to the selection probability or frequency of individual variables. This 
probability is achieved by some resampling procedures with variable selected by least squares 
kernel machine BIC for each single resampling. We propose two resampling procedures: one 
is based on bootstrapping for large sample size and the other is based on permutation for 



small sample size. Our resampling procedures are further described in Section 6A and 6.2 
respectively. 

The least squares kernel machine BIC is defined as 

df log(ra) 



BIC = fog(RSS) + 



n 



where RSS — (y — f) T (y — f). For given minimum solution the estimated function f can 
be expressed as i* = Sy, where 5* is the smoothing matrix. For the least squares error kernel 
machine, S = K(^) (^Xol + K(^)^ , the degrees of freedom of the kernel machine smoother 
5* is defined as df = Trace(S'). BIC was used by Liu et al. (2007) in the semiparametric 
mixed model with the least squares kernel machine. 



4 Some Theoretical Properties 

Consistency in variable selection problem includes two aspects: estimation consistency and 
model selection consistency. Between the two, one does not necessarily imply the another. 
The former requires £ — $,* — > as n — > oo, and the later requires lim n P(A = A) — > 1. The 
model consistency is also called sparsistency, shorthand for "sparsity pattern consistency" 
(Ravikumar et al., 2009). 

In this section, similar to consistency of LASSO, we first show that under certain condi- 
tions the NGK estimator is y/n consistent. Then we will further discuss conditions for which 
the NGK estimators are sparsistent for initial ex.. This is important because, in our NGK 
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algorithm, we assume the initial cx is fixed. 



4.1 Necessary and Sufficient Conditions for Consistency of £ 

We first establish the yfn consistency of £ estimation in Theorem 1 and then provide sufficient 
and necessary conditions in Lemma 3. 



Theorem 1: Under the regularity conditions (11), if y/n\ — > 0, then there exists a local 
minimum £ of Q(£) such that ||£ — £*|| = 0, p (n~ 1 / 2 ). 

The proof of Theorem 1 is similar to Fan and Li (2001) and Wang and Leng (2007), 
where both used the regularity conditions of the log likelihood function. We show the proof 
for Theorem 1 in Appendix 



A.2 



The \fn estimation consistency of £ guarantees that, when 



n is large enough, the minimum solution of (12) is consistent with £*. This Theorem means 



when n is sufficiently large, the kernel matrix K(£) is close to K(£* 
Defining the sign function of 



+1, if & > 
0, if = 0, 



the following lemma states the necessary and sufficient conditions for £ to be consistent. 
Lemma 3: Given initial a = a* , the necessary and sufficient conditions for £ to be consis- 
tent, i.e. lim n P(^4 = A) — 1 or lim n P{sgn(£) = sgn(£*)} = 1, are 
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(21a) 
(21b) 



Note that in the above expressions, 1 is a vector of l's (size different for (21a) and (21b)). 

To prove Lemma 3, we need some approximation form of ([5]). According to Theorem 
1, £ is y^-consistent, i.e. £ — > as n — > oo. The linear approximation of the kernel 
function holds: K{1) = K{?) + E- =1 (4' " $)#KO + O p (\\£ - Til 2 )- Given a = a*, 
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y — f = y — K($ > *)ct* = e. Plugging ct = ct* and approximated K(£) into the expression of 

Q(£) in (J5J), the regularization problem is approximated as 

x p p 

An 



3=1 



^(ii-C^K'itW + nX^ir (22) 



By using notation y = y - ~ t«* + Y$=i§ K j(£*) a * = e ~ f a * + Z €* and 



3=1 
p 



3=1 



V 



rearranging the above expression, we have the following equivalent expression 

p 



+ n\J2^3 



(23) 



3=1 



Expression (23) is similar in form to the linear nonnegative garrotte objective function ([7]) 
proposed by Yuan and Lin (2007) except for the modified response y has a non-linear term 



2 • 



Note that we use the above approximation of the kernel only for theoretical analysis 
purpose. For algorithm derivation, we only approximate the kernel in one £j direction. Thus 



we can not derive a form similar to (23) for a Gaussian kernel because y and the Z matrix 



are no longer fixed and updated by £ each iteration (See Section |3J). For a linear polynomial 
kernel, since the kernel is a linear combination of multiple kernels, no approximation is 
needed and we can derive the exact linear negative garrote form as above. 



When the minimum solution of £ is close to as Theorem 1 states, the solution of (23) 



is consistent with the solution of ([5]). Thus we can start from (23) to derive the incoherence 



conditions as (21a)-(21b) (see Appendix A. 3) 



4.2 Recovery of Sparsity 



Note in Lemma 3, conditions (21a)-(21b) are derived with the initial a = a* . However, in 



practice, we consider 5-consistent ex. and Z matrix. A question arises about whether or not 
we can similarly solve Kq and ^ based on a, 



Xk 



n 
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(24a) 
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(24b) 



such that we can use them to recover the sparsity of £*, where is the subgradient of ||£||i 



corresponding to those £j = (see the Appendix A. 3). To show these equations (24a)- (24b) 
are satisfied, we consider additional conditions required on cx for recovering sparsity for how 
fast it converges to cx* . 

The above argument shows that, if we have a consistent estimate of cx*, we can recover 



sparsity of £ using (24a)-(24b) so that we do not need to estimate cx and £ at the same time. 



Based on this idea our algorithm is developed. In our algorithm we use some cx as the initial 
value and keep it fixed for the entire solution path of £. 



Thus, motivated by consistency conditions (21a)-(21b), we consider the following zero 
noise incoherence conditions on the Z matrix: 



SoiS^l - 2^Z T Pa* r< (1-7)1 



(25a) 



where 7 G (0, 1], and P = [I — Z\{Z\Zx) 1 Zf] is a projection matrix. Expressions (21a) 



(21b) and (25a) are calculated based on the true cx*. We can show that as long as cx is 



(^-consistent with 8 — > 0, the similar condition, EoiSn 1 ! — yjZqPcx ^ (1 — 7)1 is satisfied 



(see Appendix A.4), where 7 G (0, 1]. Furthermore, we need the following assumptions for 
a based calculations. 

(£11) > C min > (25b) 



EoiSn 1 — ► Soi^n 1 with rate no slower than 6, 



1-1 



where A min (-) denotes the minimum nonnegative eigenvalue. 



(25c) 



There are two interesting features of (25a). First, unlike the incoherence conditions of linear 



LASSO where S n and S i are the correlation matrices of predictors, S n and S i in (25a) are 



the correlation matrices of the z^-'s, the vectors of the first derivative of initial f = Kcx* with 
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respect to the £/s. Second, besides En and £ i terms, (25a) contains an extra -^Z^ Pcx* 
term, which is related to the nonlinear component a* projected to the perpendicular space 
of the Zi matrix space. 

Theorem 2: Under the following conditions 

1. the initial estimate a. is 5 consistent, i.e. \a — ck*|oo = O p (5) for some S — > 0, and 



2. (25a) -(25c), 



there exits some X with n\ 2 — > oo such that for some constant 771 > ; with probability 
1 — exp(— f]in\ 2 ) -flue have results (a)-(b): 



(a) A C A and the upper bound of ||^ — £*||oo converges to 
p(A) = A 



+ HSnloo ■ ^ (n" 1/!, ||v*||oo + Op(5)) + Pn 1 ! 



ffi— • N"ll HOC A 
V u mm 

(b) If p(A) < minjg^^*, then we have sparsistency of i.e. A = A. 

This Theorem 2 generalizes Theorem 1 of Yuan and Lin (2007) on consistency of the 
linear nonnegative garrote for nonadditive models. Proof is similar to Wainwright (2009) 
and Ravikumar et al. (2009) which are based on the technique of a primal dual witness on 
model selection consistency. In Theorem 2, we use the assumption that Zf Z\ is invertible. 
Note that without this assumption, the solutions to $, l and kq are not unique. 



In Theorem 2, A is required to be greater than y • C, where C is some constant 
determined by 5, a 2 , and 7, so that exp(— rjinX 2 ) — > as nX 2 — > 00 and (a) is satisfied. 
This places some limitation on A such that it can not be artificially small. Nevertheless, 



according to (a) in Theorem 2, if we further have A + X^ya/n + O p (Xo^fa5) + ^faX — > 0, 
then ||£ — £*|| — > implying we can have estimation consistency as well. 



22 



5 Simulation Results 

5.1 Comparison with Linear LASSO 

In many cases, even though the underlying true model is nonlinear, variable selection using 
linear LASSO can be easily used since algorithms for linear LASSO are already available (e.g. 
LARS). These algorithms might work well as long as the following incoherence condition is 
satisfied, 

\XTX 1 (X?X 1 )- 1 s ga (P* 1 )\ d 1, (26) 

where Xo and Xi are the matrices of irrelevant and relevant predictors, and (3\ represents 
the vector of true nonzero /3j's. 

In this section we show a special case that using the NGK method sparsity of input 



variables can be recovered, while linear LASSO fails due to unsatisfied condition (26). 

We use the same 3-variable setting by Zhao and Yu (2006) where they used simulation 
to demonstrate the incoherence condition in linear LASSO. First we generate iid random 
variables xi, X2, e and e from X(0, 1) with sample size n = 100. The third predictor X3 is 
generated by 

x 3 = axi + 6x 2 + ce, 
where a = 2/3, b = 2/3 and c = 1/3, and the response is generated by 

y = /3*xi + /^xa + e, 

where (3* = 2 and $\ = 3. Denote X\ = [xi,X2] and Xq = [X3]. Zhao and Yu (2006) showed 



that with this setting, (^-X^Xi) QXfXi) 1 = (|, |), thus the incoherence condition (26) 
for linear LASSO is never satisfied with sgn(/3^) = sgn^). 



However the incoherence condition (25a) of NGK provides a different incoherence con- 



dition that is satisfied. To demonstrate this, we consider using a linear polynomial kernel. 
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Thus with £* = (^,l,Q) T an d £3 = 0, we have K($*) = ^xixf + ^ X 2 X 2 • Using the notation 



in ( 13 )-( 15 ), we obtain 
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In equations (]27|)-(|28|), we use the fact that for independent random normals, -xfx 



i ^3 



= 1,2 and ^x^Xj = a or 6 for j = 1 or 2. Given ck = (Ao/ + K(£))~ l y with 
£ = (1, 1, 1) T , we can calculate the left hand side of (25a). For demonstration with one 
simulation example, we calculate two incoherence condition curves vs A and Ao, respectively. 
For the first curve vs. A, we fix Ao = 0.0026 estimated by REML. For the second curve, we 
fix A = 1.516, where we choose the model with minimum BIC and vary Ao- 

Figure [TJa)-(b) show two plots: one (a) is for the incoherence condition values vs. A and 
the other (b) is for the incoherence condition values vs. Aq. They show that for certain A 



and Ao values, the incoherence condition values are smaller than one, thus condition (25a) is 
satisfied and there is a possibility that the variable selection procedure of NGK can recover 
sparsity of those irrelevant variables. 

Figure [TJc)-(d) also show plot of the regularization path for linear LASSO (c) and NGK 
(d). It can be seen that for linear LASSO, 03 is always non-zero on the path except when 
A = 0, which means linear LASSO will always select (3%. However, the regularization path of 
NGK shows that for some A, £3 = 0, but both £1 and £2 are greater than zero, providing the 
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possibility to select the correct variable set. The dashed line in Figure [jjd) indicates where 
we based model selection on minimum BIC. 



5.2 Simulation Example 1 

In this section we test the implementation of NGK on a nonlinear multiple regression simu- 
lation scenario. For this scenario, we consider the simple situation where the total number 
of predictors p is 11 and the first a = 5 predictors are relevant. Three settings with sample 
sizes n = 64, 128 and 256 are compared. For each setting, a total of 200 runs were generated. 
The nonlinear function f was generated using a stationary zero mean Gaussian process 

where we use Gaussian kernel K(£*,X) with X = [xi, ...,x p ], with each column generated 
independently by x, ~ Z7(— 2.5,2.5) , and £* = ... = = 2. The responses were then 
produced by 

y = f + e, (29) 
where e ~ N(0, a 2 I). In this scenario, we chose o 2 a = 10 and a 2 = 1 (A = 1/10) to produce 



our datasets. Note that model (29) is equivalent to y = Kot + e with ot ~ iV (0, er^iT -1 ). 

We also note that, in this example, / is not a fixed function anymore, and ex is random. 
This should lead to different incoherence conditions and, because of the randomness of a, 
the probability of recovering sparsity is expected to be lower. However, we only use this 
example to demonstrate the performance of NGK with a similar variable selection method 
like COSSO. 

The solution paths and BIC curves of one simulation run for NGK with a Gaussian or 
linear polynomial kernel are shown in Figure 1 of Supplementary Materials. The frequency of 
variables selected in the model for 200 runs are also summarized in Table 1 of Supplementary 
Materials. 
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Five statistics are reported in Table [TJ They are "False Positive Rate (FP-rate)", 
"False Negative Rate (FN-rate)", "Model Size (MS)", "Residual Sum of Squares (RSS)" 
aud "Squared Error (SE)", where FP-rate = #Fal8eP ^£££%; NegaHve , 
FN-rate = #False *^^Z P osu^ ^S = - /*)7« and SE = ~ f,) 2 /n are 

calculated for each individual run. The average and standard deviation of these statistics 
from 200 runs are reported, f's are calculated using least squared error estimation of the 
kernel machine with corresponding £, i.e. f = K(£)A~ 1 (£)y. SE can be used to assess the 
accuracy of estimation of the nonlinear function /. Note that in Table [T] and the following 
tables and plots, "NGK Gauss" and "NGK Poly" represent NGK method with Gaussian 
kernel and with linear polynomial kernel, respectively. 

The performance of COSSO and NGK methods are similar. COSSO is slightly better in 
terms of the FP rate and FN rate. However, we consider the linear polynomial kernel NGK 
method as the best method in this example, not only because it has similar FP and FN 
rates as COSSO but also because it shows the best accuracy in estimating /. In addition, 
the Gaussian kernel NGK method also has higher accuracy in estimation than COSSO. As 
expected, we can see all methods have comparable high FN rates since the function / is not 
fixed. 

5.3 Simulation Example 2 

In this example, we consider fixed / and generate response y using 

y = f + e = 10 cos(xi) + 3x1 + 5 sin(x 3 ) + 6 exp(x4/3)x 4 + 8 cos(x 5 ) + x 5 x 2 Xi + e, 

where e ~ N(0, 1) and Xj ~ £7(0, 1), j = 1, ...,p. Function / in this simulation is similar to 
the one used in Liu et al. (2007). In this example, we consider o = 10 total predictors where 
the first a = 5 are relevant. Again, three settings with sample sizes n = 64, 128, and 256 
were generated with a total of 200 runs per setting. 
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A selected example of the solution paths for two NGK methods and the BIC curves are 
shown in Figure 2 of Supplementary Materials. The selection frequency of 200 runs are listed 
in Table 2 of Supplementary Materials. Five statistics of 200 runs are summarized in Table 
[2| It can be seen that all three methods have the same zero FN rate. When the sample 
size is n = 64, the COSSO approach seems to perform the best in terms of the FP rate, but 
still has the worst estimation accuracy. When sample size increases, NGK methods quickly 
catch up in terms of FP rate. When n = 256, the three methods have nearly the same FP 
rate. It can be seen that with increasing sample size, COSSO maintains almost the same 
estimation accuracy, while NGK methods seem to estimate more accurately. In this example, 
the Gaussian kernel NGK method is considered to be the best method not only because it 
performs as well as the other methods in terms of FN and FP rates, but also because it has 
the best estimation accuracy. 

According to Example 1 and Example 2, we observe that although the COSSO method 
is based on only an additive model (or at most second order interactions), it is capable of 
variable selection for models with higher order interactions. However, when higher order 
interactions are included in the true model, additive type methods may not perform as well 
as the kernel based methods in terms of estimation accuracy. In Example 1, the interactions 
of the model might be any order since we use a Gaussian process to generate the data, and 
in Example 2, the true model contains third order interactions, but the COSSO procedure 
we apply only models the additive main effects. In contrast to this, instead of modeling 
each interaction component, the Gaussian kernel NGK method can model interactions of 
any order, as well as select the input variables correctly. 
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5.4 Simulation Example 3 

As mentioned before, when p > n, COSSO and LARS on nonnegative garrotes both fail. In 
this example, we consider a special case with p = 80 and n = 64. So far there is no other 
approach capable of modeling the nonadditive model and selecting predictors for p > n 
cases. Hence we only compare the Gaussian and linear polynomial kernel NGK methods 
using our backfitting algorithm. Example 3 has the same true function as Example 2. The 
first five predictors are relevant and a total of 400 runs have been simulated. Since computing 
becomes intensive when p is large, we only demonstrate the results with n = 64. 

Figure [2] shows example solution paths for Example 3 by the Gauss and linear polynomial 
kernel NGK methods. In both cases the number of variables selected by BIC is greater than 5. 
Other criteria such as Cp and CV in this simulation also select larger model sizes. Therefore, 
in Example 3, variable selection according to a single run is not sufficient for revealing the 
correct model. 

Because of the number of predictors, instead of a table, we portray the selection frequency 
or probability of each variable for 400 runs in Figure |3j There is little difference between 
the two NGK methods in terms of the selection probability. This can be seen from Figure [3] 
where the first five variables have selection probability very close to 1.0 for both methods. 
In addition both methods show the same behavior in that the first five variables are clearly 
separated from the remaining 75 variables in terms of selection probability. However, the 
linear polynomial kernel method has a slightly higher FN rate than the Gaussian kernel 
approach since these five probabilities are slightly higher for the Gaussian kernel method. 

From Table [3] we can see the FP and FN rates for Example 3. Compared to Example 
2, the FP rate of Example 3 for the Gaussian kernel approach is comparable, 0.09 and 
0.08, respectively. For the linear polynomial kernel method, the FP rate increases slightly. 
The major difference is that in Example 2 FN-rates are zero for both methods, but are 
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nonzero in Example 3. This is reasonable since inclusion of many irrelevant predictors 
deteriorates variable selection performance. One additional observation from example 3 
is that the standard deviation of the five statistics across 400 runs is much larger for the 
polynomial kernel method while the average is similar between both methods. This is because 
we base selection on BIC. While the BIC curve for the Gaussian kernel method has a clear 
minimum, the BIC curve for the linear polynomial kernel method drops from £j = and 
becomes flat. We select the model at the turning point of the BIC curve which may introduce 
some variability for different runs. We also realized the the average model size is greater 
than 5 from Table |3j which reflects the fact that the including more irrelevant predictors will 
result in more irrelevant predictors being selected. 

The above simulation results suggest that if we choose the model according to BIC or any 
other criteria based on only one run, we may select more irrelevant predictors. However, if 
we select the model based on the frequency or probability of selected predictors, as in Figure 
[3j it is clearer that the five true variables behave differently from the others. This provides 
new ideas regarding variable selection: it is less powerful to select the correct model with one 
single set of data than with multiple drawing of the data. Furthermore, if we use multiple 
drawing or resampling of the observations, we can estimate the selection probability of each 
variable which provides more power to select the correct variables. In the following section 
we apply this idea to two real data sets. 

6 Applications 

In this section, we describe the application of our method in two practical settings. 
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6.1 Key Selection For Cryptography Data 

Our first example is taken from a cryptography study. Side-channel analysis (SCA) is a tech- 
nique of cryptanalysis with which an attacker estimates the secret key based on information 
gained from the physical implementation of a cryptographic algorithm. Figure 3(a) in Sup- 
plementary Materials gives the diagram of attacking on an Advanced Encryption Standard 
(AES) system. The outside box represents an electronic circuit system that implements 
the AES algorithm. The AES algorithm processes the input data "zrifc" and produces an 
encrypted output u outk" using a secret key with bytes 9k, k = 0, 15. The 16 S boxes each 
takes an 8-bit byte 9k- The attacker's objective is to determine the value of the secret key 
9kS. The output of the AES algorithm is captured in each of the 16 encryption rounds, 
and the corresponding power consumption of all rounds is recorded as y. By observing the 
output, one can therefore infer 16 estimates X k = [x ljfc ,x 2) fc, • • • , x 256;fe ], k — 0, 15, corre- 
sponding to 16 secret key bytes, and there are 2 8 = 256 possibilities for each estimate with 
only one as the true. The SCA proceeds by observing n encryptions. The data structure can 
be expressed in a matrix-form as shown in Figure 3(b) of Supplementary Materials, where y 
is an n x 1 vector, and each Xk is an n x 256 matrix. Note that each Xk,j,j = 1, ...,256, is a 
function of the output outk and the jth key guess 9k,j on the fcth S box. The SCA problem 
is to find the set of columns that represents the true 9k's. The index of the selected column 
in each Xk returns the value of secret key byte 9k- 

Define X = [X , . . . ,X± 5 ] to be an n x (r x k) matrix, reflecting internal estimates for 
a SCA with n measurements, k key parts of r guesses per part. In the SCA example, 
n = 5120, r = 256 and k = 16. The objective is to identify which possible key guesses (or 
what combination of the columns of X) are highly associated with the power consumption 
trace y. Since there are k key bytes and r possible key guesses for each key byte, there are 
a total of (^ rx k k ^) possible ways to select k variables. The power consumption trace y can 
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be expressed in terms of the key estimates X by model ([T]). It is reasonable to assume that 
there are no interactions among the key guesses, so that we can use a linear polynomial 
kernel NGK method for these data. 

The data set (y,X) contains 5120 observations and a total of p = 16 x 256 = 4096 
predictors. Identifying the 16 key bytes is a variable selection problem. Due to the high 
dimensionality of the X space, directly applying our NGK approach is less efficient. Fan 
and Lv (2008) discussed sure independence screening (SIS) for ultrahigh dimensional feature 
spaces, and Fan et al. (2011) extended correlation learning in linear models to nonparametric 
independence screening (NIS) in additive models. They argue that under certain conditions, 
the probability of the screened model including all the relevant predictors, approaches one 
as n increases. We adopt a similar procedure, NIS-NGK, that is, we apply NIS and then 
perform the NGK variable selection approach using a polynomial kernel. 

Another issue for this dataset is the observation size. With n = 5120, Computation be- 
comes expensive due to calculation of the kernel K, especially when using a Gaussian kernel. 
We take a resampling approach with observation size m = 2048 to reduce the computa- 
tion burden. However, it turns out that variable selection on large datasets through multiple 
resampling is more powerful in identifying the significant predictors than selecting the predic- 
tors based on a single run. There is not much work discussing the resampling/bootstrapping 
procedure in variable selection. Hall et al. (2009) proposed a m-out-of-n bootstrap on linear 
LASSO and provided theoretical justification of their resampling approach. We extended 
this m-out-of-n bootstrap approach to our NGK method. 

The screening approach we applied is meant to rank the predictors according to the 
descent order of the residual sum of squares by a marginal nonparametric regression. 

S = {l<j<p:r j <C}, (30) 

where rj = min^. Q ||y — ^jD^a.\\ 2 and C is a predefined threshold value depending on n. 

31 



To reduce computational time, we take a = y and all £j = 1. Then the NIS screening is 
equivalent to SIS which ranks the predictors by the correlation x^y 7 . This can be seen by 
plugging a = y and £j = 1 into rj, rj = y 2 — (xjy T ) 2 . Using this approach we first screen 
the predictor size down to 200. According to the Theorem 1 in Fan et al. (2011), with n 
very large and finite predictor size, the probability of the screened predictors including the 
true predictors is close to one. 

Following the NIS step, we apply the NGK variable selection approach to one resampled 
dataset from the original 5120 observations. Figure 4 in Supplementary Materials is an 
example of the m-out-of-n resampling from the n = 5120 observations. The solution path 
in Figure 4(a) in Supplementary Materials shows that there are 16 predictors (bold lines) 
which behaves differently from the others. Because we have information about the AES key, 
we know there are 16 bytes corresponding to 16 true predictors. By checking the AES key, 
these 16 predictors are the exact 16 key bytes. Additionally, the BIC curve shows that the 16 
predictors are clearly separated from the rest of the points on the curve (see Figure 4(b) in 
Supplementary Materials ). Other critera such as Cp and GCV, all show the same behavior. 
It is obvious that these 16 predictors should be selected for the true model. However, if 
we use minimum BIC as our model selection criterion, we will have a total of 38 predictors 
selected in this run. Although these 38 predictors include the 16 true predictors, we are over 
selecting. This is true even when we sample more observations or use all 5120 observations. 

Thus we further resample the dataset up to 1200 runs with replacement with each run 
choosing variables according to the BIC criterion, and we count the frequency of selected 
variables. Since we observed that the BIC minimum usually occurs when around 50 variables 
are selected, in order to reduce computation, we use a selection window such that we choose 
the first 50 predictors in the model for each resampling. The probability/frequency of being 
selected for 200 predictors are plotted in Figure |4j Use 60% as the selection threshold, we 
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will choose 18 predictors based on Figure [|| If we use the threshold probability 80%, we can 
exactly choose the true 16 key bytes. Through resampling from a large dataset, we are able 
to simulate selection probability, and this process of variable selection is more powerful than 
simply relying on one fixed data set using usual criteria. 

6.2 Gene Selection in Pathway Data 

We next apply our Gaussian kernel NGK method to a set of diabetes data from Mootha et 
al. (2003). They provided pathway based analysis to classify two phenotypes, 17 normal and 
18 Type II diabetes patients. A pathway is a predefined set of genes that serve a particular 
cellular or physiological function. They showed that pathway based analysis can detect 
coordinate subtle changes among a set of genes. It is known that genes in a pathway are 
not independent of one another and interact with unknown structure. The top significant 
pathways related to the diabetes disease have been identified (Mootha et al., 2003). Pathway 
133 ("Oxidative phosphorylation"), pathway 4 ("Alanine and aspartate metabolism") and 
pathway 140 ("MAP00252-Alanine-and-aspartate metabolism") are three interesting ones 
which contain a total of 58, 18 and 22 genes, respectively. 

For each pathway we label the genes by their appearance index, gene 1, gene 2 ... and so 
on. Note the same gene index from two pathways does not imply the same gene. Since the 
18 genes in pathway 4 are all included the 22 genes in pathway 140, we use the gene index 
of pathway 140 to label genes in both pathways. Thus gene 4, 5, 19 and 20 do not appear in 
pathway 4. Hence, in this application, the data set structure is (y, X) with a total of n = 35 
observations and p = 58, 18 and 22 predictors, respectively. The response is the outcome of 
glucose levels. 

Figures [5^a), (c) and (e) plot the solution paths of the £j's corresponding to genes for three 
pathways. Figures |5[b), (d) and (f) show the BIC curves to select genes where a total of 13, 7 
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and 9 genes are selected, respectively. The index sets for the selected genes of the three path- 



ways by the Gauss kernel NGK method are A33 = {1, 4, 5, 14, 19, 23, 29, 31, 34, 41, 51, 53, 57}, 
Aa = {8,10,11,12,13,14,21} and Auo = {5,8,10,11,12,13,14,18,21}. However, as dis- 



able selection depending on single draw may not be powerful even if the observation number 
is large. In the diabetes data, there are only 35 observations. So we need additional steps to 
increase selection power. In this section we propose using a residual permutation procedure 
to repeat the variable selection process and counting the total frequency/probability of each 
predictor. 

• Step 1 : Apply the Gaussian kernel NGK variable selection method to the original 
dataset using the backfitting algorithm introduced in Section [3j obtain the selected 
variables £ = (0)J g ^- Use £ to fit the Gaussian kernel machine again to obtain new ex. 
and new Ao by REML such that y = K(£)a. Obtain the residual e = y — y. Center e 
by subtracting its mean. 

• Step 2: Permute the residual e to get new e* and simulate outcomes as y* = K(£)ac+e* . 

• Step 3: Based on the new dataset (y*,X) with fixed initial ex and fixed Ao, apply the 
NGK variable selection method again and obtain the selected gene set. 

• Step 4- Repeat Steps 2-3 for a large number of iterations (e.g. 3000 times). 

• Step 5: Obtain the empirical probability/frequency of selecting each variable. 

The results of NGK permutation procedure are summarized in Figure [6j If we take 
60% as the threshold, the sets of genes selected are A{ 33 = {4,5,14,19,23,31,34,41,53}, 
A\ = {8, 10, 11, 12, 21} and A* im = {5, 12, 21} respectively. Because pathway 4 is a subset 
of pathway 140, we plot the results of the two pathways in one plot, Figure [61(b) . Compare 
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with A, we see that A* C A. For example, for pathway 133 four extra genes selected using 



a single NGK step are {1, 29, 51, 57}. Especially for gene 1, the selection probability is less 
than 20% by permutation approach. 



Interesting observations for pathway 4 and pathway 140 are found in Figure M 



First, some of the genes are not significantly related to the response, such as genes 
{1,2,3,7,9,15,22}. In both pathways, the selection probabilities remains small for those 
genes. Another observation is that some genes are significantly related to the response and 
retain a high selection probability in both pathways (gene 21, for example). Furthermore, 
some genes seem to interact with one another. For example, genes {10, 11, 12, 13, 14} appear 
to group a gene segment with similar selection probability. An interesting gene is gene 5, 
which does not appear in pathway 4. Gene 5 has the highest selection probability in pathway 
140. While gene 5 is present in pathway 140, the selection probabilities of {8, 10, 11} are 
smaller than in pathway 4. This indicates some interaction may occur between gene 5, gene 
8 and the gene segment {10, 11}. 

7 Discussion 

In this paper we have proposed a new variable selection approach to recover sparsity of 
the multivariate input variable in a nonadditive smoothing function. Our approach can 
be addressed as a nonnegative garrote variable selection procedure with kernel machine. 
The method we proposed has several advantages: (1) it can recover sparsity as well as 
model any order interactions automatically; (2) it is applicable not only to nonadditive 
smoothing functions, but also to additive model by choosing a different kernel; and (3) it 
establishes a connection among several existing methods including linear nonnegative garrote, 
kernel learning and ARD problems. We have also developed an efficient coordinate descent 
updating procedure for the scale parameters £j's which inherits the nice properties of the 
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regular backfitting method and can replaces the LARS algorithm in models with additive 
multiple kernels. 

The results in this paper show some theoretical properties similar to linear LASSO and 
linear nonnegative garrotes. However, other theoretical properties require further study, such 
as the convergence rate of the coordinate descent algorithm and the performance of model 
selection criteria such as BIC in least squares error kernel machines. Furthermore, in this 
paper, we suggested resampling variable selection procedures in two cases: when n is large 
and when n is small. Thus consistency and convergence rate of resampling/bootstrapping 
on NGK approaches are interesting future topics as well. 

Possible extensions of our method include applying NGK approaches to generalized linear 
models (GLM). Logistic kernel machine regression or multiple categorical classification are 
popular for many applications. Selecting input variables using NGK applied to GLMs is 
challenging work as the link functions are nonlinear too. 

Another interesting extension of our method is consideration of more complicated kernel 
structures. To illustrate this, we could consider a dataset with q multivariate variables 
such as q genetic pathways, each one containing multidimensional genetic expressions and 
potential genes sharing between pathways. Thus the kernel could be expressed as K = 
piK 1 (£ 1 ,X 1 )+p 2 K2(£ 2 ,X2) + ...,p q K q (£ q ,X q ). Applying penalty 011/9 = (p u ...,p q ) T and on 

...,£ q , NGK may be able to recover sparsity of the X/s as well as the additive functional 
components of {/_,■ = i^a}'s. This might be considered as group NGK and we can apply it 
to selecting pathways and interactions from a pathway pool. 
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Appendix A 

A.l Proof of Lemma 2 

Proof: This is a result of the composition theorem (Boyd and Vandenberghe 2004 Chp.3). 
For function Q Q (£) = Q (K{£)) with domain dom Q (^) = e dom K{£)\K{£) G 
dom Qq(K)}, if Qq(K) is convex and non-increasing and K(£) is concave, then Qo{$) is con- 
vex. To see this, assuming £' G dom Qo(£)> we have G dom if (£) and K(£), K(£') G 
dom Qq(K). Since dom if(£) = is convex, we have 9$, + (1 — G dom K(£) and, 
from concavity of if(£), we have 

if (0£ + (1 - 0)?) h OKU) + (1 - 9)K{£'). (A.l.l) 

Since if(£),if(£') G K* C dom Q„(*0 = K, we conclude that 6K{$) + (1 - 0)if(£') G 
dom Q (-?0- Since 0£ + (1 - 0)£' G dom if(£), we have if(0£ + (1 - 6)£') G dom Q (K) 
too, which means 0£ + (1 — 9)g G dom Qo(£)- Now, using the fact Qo(K) is nonincreasing 



and (A.l.l), we have 

Qo {K(9£ + (1 - 5)0) < Qo {0K{£) + (1 - 0)if (£')) • (A.1.2) 
Because of the convexity of Qq(K), we have 

Qo (0if (£) + (1 - 0)Ktf)) < 9Q (#(£)) + (1 - 0)Q O (if (O) • (A.1.3) 
Combining the above two inequations, we get 

Qo (0* + (1 - 0)0 < 0Qo (€) + (1 - 0)Qo (O , (A.1.4) 

which proves the convexity of Qo(£). Since ||£||i is a convex function of £, this implies 
convexity of Q(£). 
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A. 2 Proof of Theorem 1 



Proof: We use the expression (12) to prove Theorem 1. Following Fan and Li (2001), to 
show the existence of a d n = -nT ^-consistent local minimum in the ball + d n u : ||u|| < C, 
we need to show that for any given e > 0, there exists a large enough constant C such that 

(A.2.1) 



liminf P { inf Q(£* + d n u) > Q(£*) } > 1 - e. 

||u||=C 



To show that, we first calculat following expression: 
Q(t + d n u) - Q(t) 

T J2 



dr. 



dQ 



> v^A d n v n (C) T u + ^^u T M n (^*)u + nd n X £ (|& + rf r 



(A.2.2) 



l««l) 



i=l 



> A v^(r )u + yU T M„(r )u - v^Aa||u| 



Using the regularity conditions of (17), we note that v„ = O p (l). Thus in the right hand side 



of (A.2.2), the first term is uniformly bounded by second term for C sufficiently large. To see 



this, at || u || = C, 0.5u T M n u is uniformly larger than 0.5A m j n (M„)C 2 which is a quadratic 
function of C because M n is finite positive (semi-)definite. And ||v^u|| < ||v^||C which is 
linear function of C since ||v^|| = O p (l). For sufficiently large C, the quadratic form of C 
always dominates the linear form of C. As n — > oo, we assume y/n\ — > 0, thus the last term 



is also bounded by the second term. Hence by choosing a sufficiently large C, (A.2.1) holds. 



A. 3 Proof of Lemma 3 



Proof: To continue the proof, (23) is a convex function of £ by the Karush-Kuhn- Tucker 



conditions for optimality in a convex program, the point £ £ is optimal if and only if 
there exists a subgradient k £ <9(||£||i) such that 

dQ 



+ nXk = 0, 



(A.3.1) 
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where Qq is the first two terms of (22). The collection of subgradient of ||£||i at point £ is 
the subdifferential f?(||£||i): 



£||i) = {k E W : kj = 1 for > 0; % < 1 for = 0}. 



(A.3.2) 



Plugging back into (23) and operating simple algebra, we have 



Z T Z{£ - C) - Z T e + -j-Z T a* + n\k = 0. 



(A.3.3) 



Suppose lim n P(^4 = A) — > 1 or lim„P(sgn(£) = sgn(£*)) — > 1, thus 



ti >- 0) £o — 0, and & x = 1, k ^ 1. 



Substituting these observations and rearranging (23), we have 



1 ryT ( ^0 J 

— Zf e ck 

rz 1 V 2 



Al 



Ak , 



n 



1 7 T / -V) , 

— Z-i e ck 

ra 1 V 2 



Al 



(A.3.4a) 



(A.3.4b) 



Considering conditions (A.3.2) for fc an d £i >- 0, we have the sufficient and necessary 



conditions of (21a) and (21b) 



A. 4 Proof of Theorem 2 

Proof: Condition \a— ct*|oo = O p (S) implies that |afcaj— a£a:*| < ( | otfe | + 1 o:* a*\ = O p (S) 
for 1 < k, I < n, thus we have the relationships 



-1~T~ 

n z i Zj = n 



-^• + 0,(5), 



n 1 z T j ot =n~ 1 zja* + O p (6), 



(AAA) 
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where 1 < i, j < p. These relationships are derived from the following two inequalities: 



n 



~t~ _ t \ 

Z i Z j Z i Z j J 



< 



< 



n 3 
I k,l 
K k,l 



ik,l{ a k a l - a k a l. 



n 

O p {8) 



n 



\k,l\ a k®l - ®k a l 



k,l 



(A.4.2) 



n 



< O p (5) ■ C = O p {8) 



and 



n 



< 



77, 



(A.4.3) 



n 

<O p (5)-C = O p (<5), 

where the C's are some small positive numbers. These inequalities are true for Gaussian and 
linear polynomial kernels because X is standardized. For example, for a Gaussian kernel, we 
have n _1 l T |Aj|l < 2 where | • | is the componentwise absolute value. To see this, first note 
that 1 T |A'|1 = 1 T \K oD°\l < 1 T |Z>?|1 since all elements of K are positive and smaller than 



1, and 1 T |D J |1 = J2ki( x jk 



x 



jl) 



T, k ,i( x % + x )i - 2x jkXji) = J2k( nx % + 1) = 2n - For 



a linear polynomial kernel, K'- = D 3 = x^xj. Thus l T |Aj|l = J2ki \ x jk x jl\ < \ x ji\) 2 — 
nY^i x \ — n - 111 both cases we use Yl,i x ji = an d = 1- Similarly we can show that 



the inequalities for n 1 l T \K' i K'Al are bounded by some small numbers. 



In addition, from conditions (25a)-(25c) and the relationships (A. 4.1), with 5 — > 0, the 



left hand side of (25a) only differs from SoiS u 1 — yjZqPo. by O p (5) (P is the projection 
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matrix and thus does not change much the norm). For 5 sufficiently small, 



2n\ 



Z+ Ptx ^(1-7)1 



(A.4.4) 



holds for some 7 G (0, 1]. The converse is also true: if A.4.4 satisfied, then we can always 



find a small positive 7 when S — > such that the condition (25a) is true. This equivalence 



allows us to show sparsistency by using (A.4.4) 



Starting from (A.4.4), our argument is based on the technique of a primal dual witness 



on model selection, consistency of the lasso which contains the following steps (Wainwright, 
2009): 

1. Obtain £ x by solving (24b), and set £ = 0, 

2. Set k\ = d(\\£l\\i), for our model with nonnegative garrotte k\ = 1, 



3. With these setting of $ >1 and kx, obtain k through (24a), and check whether or not 



k £ <9(ll£olli)' f° r f° r our m °del with nonnegative garrotte k -< 1, 



4. Check whether k\ = 1. 



Lemma 2 in Wainwright (2009) states that if dual feasibility is established (Step 1-3 succeed), 
then A C A. In Step 3 using k -< 1 instead of k ^ 1 ensures uniqueness by strict dual 
feasibility. Furthermore, if Step 4 succeeds as well, then A = A. 

Following Wainwright (2009), Theorem 2 is proved in two steps. Given a and Z defined 



as before, from (24a 24b), we define two random variables: 



Ai :-- 



z x {z T x z x y l \ 



where ej is the selection vector with 1 in the jth position. 
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Dual feasibility 

Write At as E(At) + A*, where E(Ai) = zf \z x {ZlZ x Y x \ - ^Pa}, and A* 
1 -zJ Pe. To have the subgradient vector Kq ^ 1 is equivalent to showing 



nX i 



max A < 1. 



Using the definition of A, and condition (A.4.4), we have 



max < (1 — 7) + maxi*. 

i i 

A* is a zero mean sub-Gaussian random variable and, according to Wainwright (2009), 

the variance of A* is bounded by 

2 22 



which can be shown using the relationship (A. 4.1), the properties of the projection 



matrix, and normalized Zj vector such that ||zj||| < n, and 8 — > 0. 

By the sub-Gaussian tail bound results combined with the union bound (Wainwright 
2009), we have 

P f max A* > i] < (p - o) exp ( - ^ 7 r- r =rX 

\i l ~2j- Ky ' y \ 2a 2 \- 2 n- 1 [l + O p (S)) J 

= exp |-^r(l + O p {5))- 1 + log(p - a) 
Putting all these parts together, we conclude that 

P ^max > 1 — < exp (— r]i\ 2 n). 

If we choose some A such that 8er2 ( 1H ?o ^ > log(p — 0)1 say 

X>lJ^^m% (A.4.5) 
7 V n 

the probability for {maxj Aj > 1 — 7/2} vanishes with rate exp(— i]i\ 2 n) as n — >■ 00. 
Or in other words, with probability 1 — exp(— i]i\ 2 n), we have A C A 
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Bounding — £*|| 
The upper bound of ^ 1 — $,1 



is 



|£l £ll|oo < 



Sn 1 ( -Zfe 

n 

-v 

1 



A||S 



-l II 

11 ||oo 



11 



Note the oo— norm of matrix E xl is bounded as 



(A.4.6) 



1^11 II oo < V^^min 



(A.4.7) 



Thus, part III is bounded as A||S X1 ||oo = y/aXC m l n . 



Part II can be bounded as 
II : 



< WfrH 

— II 11 ll°o 



AO rvT ~ 

2n 1 



if]- 1 II 

I ^11 II oo 



A 







n \ 2\fn 



n. 



where we use (A. 4.1) and (18) for v*. Using (A.4.7) we have 



(A.4.8) 



IK ^ 



aA 



Cn 



n 1/2 max | v* \ + P (S) 



(A.4.9) 



Note that in £j — the random portion is Uj := ejE lx (n 1 Zfe) with e ~ N(0, cr 2 /) 
so Uj is zero mean Gaussian, i.e. E{Uj) = 0, and 



(T 2 ~ 



(A.4.10) 



Again using the sub-Gaussian tail bound (Wainwright 2009), we have 
P f max \Uj\> t] < 2 exp ( " . + log a 



2 exp 



2a 2 



C min + log a . 



Setting i = 4crAC m ^ 2 , and by choosing A as in (|A.4.5|), we have 8A 2 n > logp > logo 



(A.4.11) 



so that P (maxj \Uj\ > 4aXC m }l 2 J — > with rate at least 2exp(— r] 2 X 2 n) where rjz > 0. 
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And we are bounding 

ll£l — €l lloo < A 

< A 



4(7 



+ \\^u\ 



mm 

4a 



Ao 
A 



^n~ 1/2 max \v*\ + O p (<5)^ + 



\n\/a 



\fc n 



ACr, 



-1/2 



max |^*| + O p (5) ) +-J 

Or, 



with probability 1 — 2 exp(— r] 2 X 2 n). 

From the bounding expression, we can see that if we have A + \ T</aJn + O p (\ y/a5) + 
y/a\ — > and A 2 n — > oo, the we have £ x — > £* with probability 1. 
Furthermore define 



P(A) = A 



4a 



V ^ C m i r 



+ l|£nl 



Ao 
A 



v „ (n- 1 / 2 ||v*|| 00 + O p ( < 5)) + ||Eri 1 | 



Hence we finally conclude that, as A 2 n — > oo, if p(A) < min^-g^*, then we have all 
£j > 0, j G .4., thus establishing the sign consistency A = A. 
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Table 1: Simulation results of Simulation Example 1 for 200 runs. "NGK Gauss" and 
"NGK Poly" represent NGK method with Gaussian kernel and with linear polynomial kernel, 
respectively. 







FP-rate 


FN-rate 


MS 


RSS 


SE 




NGK Gauss 


0.12(0.11) 


0.20(0.15) 


4.46(1.77) 


1.29(0.71) 


0.55(0.63) 


n = 64 


NGK Poly 


0.08(0.10) 


0.20(0.12) 


4.26(1.15) 


0.92(0.19) 


0.14(0.07) 




COSSO 


0.09(0.10) 


0.19(0.12) 


4.42(1.24) 


1.20(0.30) 


1.01(0.18) 




NGK Gauss 


0.09(0.09) 


0.22(0.14) 


3.98(1.56) 


1.13(0.36) 


0.26(0.33) 


n = 128 


NGK Poly 


0.06(0.08) 


0.21(0.12) 


3.96(1.07) 


0.96(0.12) 


0.07(0.04) 




COSSO 


0.05(0.08) 


0.21(0.12) 


3.87(1.10) 


1.04(0.15) 


1.00(0.12) 


n = 256 


NGK Gauss 
NGK Poly 


0.07(0.09) 
0.05(0.08) 


0.22(0.15) 
0.24(0.10) 


3.83(1.65) 
3.68(0.98) 


1.10(0.31) 
0.96(0.08) 


0.16(0.29) 
0.04(0.02) 




COSSO 


0.04(0.07) 


0.18(0.13) 


4.03(1.07) 


1.00(0.10) 


1.00(0.09) 



Table 2: Simulation results of Simulation Example 2 for 200 runs. 
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n = 128 


NGK Poly 


0.04(0.07) 


0.00(0.00) 


5.23(0.46) 


1.20(0.15) 


0.31(0.05) 




COSSO 


0.01(0.04) 


0.00(0.00) 


5.06(0.27) 


0.95(0.14) 


1.02(0.13) 




NGK Gauss 


0.01(0.03) 


0.00(0.00) 


5.04(0.18) 


1.12(0.12) 


0.20(0.05) 


n = 256 


NGK Poly 


0.01(0.03) 


0.00(0.00) 


5.03(0.17) 


1.22(0.11) 


0.29(0.03) 




COSSO 


0.01(0.03) 


0.00(0.00) 


5.05(0.21) 


0.98(0.09) 


1.01(0.09) 



Table 3: Simulation results of Simulation Example 3 for 400 runs. 





FP-rate 


FN-rate 


MS 


RSS 


SE 


NGK Gauss 

n = 64 

NGK Poly 


0.08(0.04) 
0.08(0.11) 


0.003(0.022) 
0.030(0.110) 


11.88(4.21) 
12.52(14.5) 


1.56(0.30) 
1.27(1.53) 


1.01(0.20) 
0.87(1.37) 





Figure 1: (a) Incoherence condition values vs. A with Ao fixed at 0.0026, (b) Incoherence 
condition values vs. Ao with A fixed at 1.516, (c) solution path of fy's for linear LASSO, and 
(d) solution path of &'s for NGK. All plots use initial 6t = A -1 (£) with | = (1, 1, 1) T . 




Figure 2: Selected example of NGK solution path for Simulation Example 3 using Gaussian 
kernel, (a) and (b), and linear polynomial kernel, (c) and (d). Left side: £/s vs. L\ norm of 
£j's, Right side: £j's and BIC vs. log A. 
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Figure 3: Selection probability of each predictor in Simulation Example 3 for 400 runs using 
two NGK methods. 
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Figure 4: Selection probability of each key guess of SCA data using m-out-of-n resampling 
procedure, m = 2048, n = 5120 and total 1200 runs. 
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Figure 5: NGK solution paths and BIC curves for diabetes data pathway 133, 4, and 140 
using Gaussian kernel. Left side: f^-'s vs. L\ norm of £/s, Right side: £/s and BIC vs. log A. 
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Figure 6: Selection probability of each gene using the residual permutation method for 
pathway 133 (a), and pathway 140 and 4 (b), with a total of 3000 runs for each pathway. 



